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4.1 INTRODUCTION 

T he impacts of present and potential future climate change will be 
one of the most important scientific and societal challenges in the 
21st century. Given observed changes in temperature, sea ice, and sea 
level, improving our understanding of the climate system is an interna- 
tional priority. This system is characterized by complex phenomena that 
are imperfectly observed and even more imperfectly simulated. But with 
an ever-growing supply of climate data from satellites and environmental 
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sensors, the magnitude of data and climate model output is beginning 
to overwhelm the relatively simple tools currently used to analyze them. 
A computational approach will therefore be indispensable for these 
analysis challenges. This chapter introduces the fledgling research disci- 
pline climate informatics: collaborations between climate scientists and 
machine learning researchers in order to bridge this gap between data and 
understanding. We hope that the study of climate informatics will acceler- 
ate discovery in answering pressing questions in climate science. 

Machine learning is an active research area at the interface of com- 
puter science and statistics, concerned with developing automated tech- 
niques, or algorithms, to detect patterns in data. Machine learning (and 
data mining) algorithms are critical to a range of technologies, including 
Web search, recommendation systems, personalized Internet advertis- 
ing, computer vision, and natural language processing. Machine learning 
has also made significant impacts on the natural sciences, for example, in 
biology; the interdisciplinary field of bioinformatics has facilitated many 
discoveries in genomics and proteomics. The impact of machine learning 
on climate science promises to be similarly profound. 

The goal of this chapter is to define climate informatics and to propose 
some grand challenges for this nascent field. Recent progress on climate 
informatics, by the authors as well as by other groups, reveals that col- 
laborations with climate scientists also open up interesting new problems 
for machine learning. There are a myriad of collaborations possible at 
the intersection of these two fields. This chapter uses both top-down and 
bottom-up approaches to stimulate research progress on a range of prob- 
lems in climate informatics, some of which have yet to be proposed. For 
the former, we present challenge problems posed by climate scientists, and 
discussed with machine learning, data mining, and statistics researchers 
at Climate Informatics 2011, the First International Workshop on Climate 
Informatics, the inaugural event of a new annual workshop in which all 
co-authors participated. To spur innovation from the bottom-up, we also 
describe and discuss some of the types of data available. In addition to 
summarizing some of the key challenges for climate informatics, this 
chapter also draws on some of the recent climate informatics research of 
the co-authors. 

The chapter is organized as follows. First, we discuss the types of cli- 
mate data available and outline some challenges for climate informatics, 
including problems in analyzing climate data. Then we go into further 
detail on several key climate informatics problems: seasonal climate 
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forecasting, predicting climate extremes, reconstructing past climate, and 
some problems in polar regions. We then discuss some machine learning 
and statistical approaches that might prove promising (and that were not 
mentioned in previous sections). Finally, we discuss some challenges and 
opportunities for climate science data and data management. Due to the 
broad coverage of the chapter, related work discussions are interspersed 
throughout the sections. 

4.2 MACHINE LEARNING 

Over the past few decades, the field of machine learning has matured sig- 
nificantly, drawing on ideas from several disciplines, including optimiza- 
tion, statistics, and artificial intelligence [4, 34], Application of machine 
learning has led to important advances in a wide variety of domains rang- 
ing from Internet applications to scientific problems. Machine learning 
methods have been developed for a wide variety of predictive modeling 
as well as exploratory data analysis problems. In the context of predictive 
modeling, important advances have been made in linear classification and 
regression, hierarchical linear models, nonlinear models based on kernels, 
as well as ensemble methods that combine outputs from different predic- 
tors. In the context of exploratory data analysis, advances have been made 
in clustering and dimensionality reduction, including nonlinear methods 
to detect low-dimensional manifold structures in the data. Some of the 
important themes driving research in modern machine learning are moti- 
vated by properties of modern datasets from scientific, societal, and com- 
mercial applications. In particular, the datasets are extremely large scale, 
running into millions or billions of data points; are high-dimensional, 
going up to tens of thousands or more dimensions; and have intricate 
statistical dependencies that violate the “independent and identically dis- 
tributed” assumption made in traditional approaches. Such properties 
are readily observed in climate datasets, including observations, reanaly- 
sis, as well as climate model outputs. These aspects have led to increased 
emphasis on scalable optimization methods [94], online learning methods 
[11], and graphical models [47], which can handle large-scale data in high 
dimensions with statistical dependencies. 

4.3 UNDERSTANDING AND USING CLIMATE DATA 

Profuse amounts of climate data of various types are available, providing 
a rich and fertile playground for future data mining and machine learn- 
ing research. Here we discuss some of the varieties of data available, and 
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provide some suggestions on how they can be used. The discussion opens 
up some interesting problems. There are multiple sources of climate data, 
ranging from single-site observations scattered in an unstructured way 
across the globe to climate model output that is global and uniformly 
gridded. Each class of data has particular characteristics that should be 
appreciated before it can be successfully used or compared. We provide 
here a brief introduction to each, with a few examples and references for 
further information. Common issues that arise in cross-class syntheses 
are also addressed. 

4.3.1 In-Situ Observations 

In-situ measurements refer to raw (or only minimally processed) measure- 
ments of diverse climate system properties that can include temperatures, 
rainfall, winds, column ozone, cloud cover, radiation, etc., taken from spe- 
cific locations. These locations are often at the surface (e.g., from weather 
stations), but can also include atmospheric measurements from radio- 
sonde balloons, subsurface ocean data from floats, data from ships, air- 
craft, and special intensive observing platforms. 

Much of this data is routinely collected and made available in col- 
lated form from National Weather Services or special projects such as 
AEROCOM (for aerosol data), International Comprehensive Ocean- 
Atmosphere Data Set (ICOADS) (ocean temperature and salinity from 
ships), Argo (ocean floats), etc. Multivariate data related to single experi- 
ments (e.g., the Atmospheric Radiation Measurement (ARM) program or 
the Surface Heat Budget of the Arctic (SHEBA)), are a little less well orga- 
nized, although usually available at specialized websites. 

This kind of data is useful for looking at coherent multivariate com- 
parisons, although usually on limited time and space domains, as input to 
weather model analyses or as the raw material for processed gridded data 
(see next subsection). The principal problem with this data is their sparse- 
ness spatially and, in time, inhomogeneities due to differing measurement 
practices or instruments and overall incompleteness (not all variables are 
measured at the same time or place) [45, 62]. 

4.3.2 Gridded/Processed Observations 

Given a network of raw in-situ data, the next step is synthesizing those 
networks into quality- controlled regularly gridded datasets. These have 
a number of advantages over the raw data in that they are easier to work 
with, are more comparable to model output (discussed below), and have 
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fewer nonclimatic artifacts. Gridded products are usually available on 
5° latitude by 5° longitude grids or even higher resolution. However, these 
products use interpolation, gap-filling in space and time, and correc- 
tions for known biases, all of which affect the structural uncertainty in 
the product. The resulting error estimates are often dependent upon space 
and time. Different products targeting the same basic quantity can give 
some idea of the structural uncertainty in these products, and we strongly 
recommend using multiple versions. For example, for different estimates 
of the global mean surface temperature, anomalies can be found from the 
National Climatic Data Center (NCDC), the Hadley Centre, and NASA 
[6, 33, 90] that differ in processing and details but show a large amount of 
agreement at the large scale. 

4.3.3 Satellite Retrievals 

Since 1979, global and near-global observations of the Earth’s climate have 
been made from low-earth orbit and geostationary satellites. These obser- 
vations are based either on passive radiances (either emitted directly from 
the Earth, or via reflected solar radiation) or by active scanning via lasers or 
radars. These satellites, mainly operated by U.S. agencies (NOAA, NASA), 
the European Space Agency, and the Japanese program (JAXA), and data 
are generally available in near-real-time. There are a number of levels of 
data, ranging from raw radiances (Level 1), processed data as a function 
of time (Level 2), and gridded averaged data at the global scale (Level 3). 

Satellite products do have specific and particular views of the climate 
system, which requires that knowledge of the “satellite-eye” view be incor- 
porated into any comparison of satellite data with other products. Many 
satellite products are available for specific instruments on specific plat- 
forms; synthesis products across multiple instruments and multiple 
platforms are possible, but remain rare. 

4.3.4 Paleoclimate Proxies 

In-situ instrumental data only extends on a global basis to the mid-19th 
century, although individual records can extend to the 17th or 18th century. 
Lor a longer term perspective, climate information must be extracted from 
so-called “proxy” archives, such as ice cores, ocean mud, lake sediments, 
tree rings, pollen records, caves, or corals, which retain information that is 
sometimes highly correlated to specific climate variables or events [41]. 

As with satellite data, appropriate comparisons often require a for- 
ward model of the process by which climate information is stored and that 
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incorporates the multiple variables that influence any particular proxy [75]. 
However, the often dramatically larger signals that can be found in past cli- 
mates can overcome the increase in uncertainty due to spatial sparseness 
and nonclimatic noise, especially when combined in a multi-proxy approach 
[58]. Problems in paleoclimate are discussed in further detail in Section 4.8. 

4.3.5 Reanalysis Products 

Weather forecast models use as much observational data ( in-situ , remote 
sensing, etc.) as can be assimilated in producing 6 -hour forecasts (the 
“analyses”), which are excellent estimates of the state of the climate at any 
one time. However, as models have improved over time, the time series of 
weather forecasts can contain trends related only to the change in model 
rather than changes in the real world. Thus, many of the weather forecast- 
ing groups have undertaken “reanalyses” that use a fixed model to reprocess 
data from the past in order to have a consistent view of the real world 
(see reanalyses.org for more details). This is somewhat equivalent to a physics- 
based interpolation of existing datasets and often provides the best estimate 
of the climate state over the instrumental period (e.g., ERA-Interim [16]). 

However, not all variables in the reanalyses are equally constrained 
by observational data. Thus, sea-level pressure and winds are well char- 
acterized, but precipitation, cloud fields, and surface fluxes are far more 
model dependent and thus are not as reliable. Additionally, there remain 
unphysical trends in the output as a function of changes in the observing 
network over time. In particular, the onset of large-scale remote sensing in 
1979 imparts jumps in many fields that can be confused with real climate 
trends [105]. 

4.3.6 Global Climate Model (GCM) Output 

Global climate models are physics-based simulations of the climate sys- 
tem, incorporating (optionally) components for the atmosphere, ocean, sea 
ice, land surface, vegetation, ice sheets, atmospheric aerosols and chem- 
istry, and carbon cycles. Simulations can either be transient in response 
to changing boundary conditions (such as hindcasts of the 20th century), 
or time slices for periods thought to be relatively stable (such as the mid- 
Holocene 6,000 years ago). Variations in output can depend on initial con- 
ditions (because of the chaotic nature of the weather), the model used, 
or variations in the forcing fields (due to uncertainties in the time his- 
tory, say, of aerosol emissions). A number of coordinated programs, nota- 
bly the Coupled Model Intercomparison Project (CMIP), have organized 
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coherent model experiments that have been followed by multiple climate 
modeling groups around the world and which are the dominant source for 
model output [96]. 

These models are used to define fingerprints of forced climate change 
that can be used in the detection and attribution of climate change [39], 
for hypothesis generation about linkages in the climate system, as test- 
beds for evaluating proposed real-world analyses [24], and, of course, 
future predictions [61]. Quantifying the structural uncertainty in model 
parameterizations or the model framework, the impact of known imper- 
fections in the realizations of key processes, and the necessity of compro- 
mises at small spatial or temporal scales are all important challenges. 

4.3.7 Regional Climate Model (RCM) Output 

Global models necessarily need to compromise on horizontal resolution. 
In order to incorporate more details at the local level (particularly regional 
topography), output from the global models or the global reanalyses can 
be used to drive a higher-resolution, regional climate model. The large- 
scale fields can then be transformed to higher resolution using physical 
principles embedded in the RCM code. In particular, rainfall patterns that 
are very sensitive to the detailed topography are often far better modeled 
within the RCM than in the global-scale driving model. However, there 
are many variables to consider in RCMs — from variations in how the 
boundary field forcing is implemented and in the physics packages — and 
the utility of using RCMs to improve predictions of change is not yet clear. 
A coordinated experiment to test these issues is the North American 
Regional Climate Change Assessment Program (NARCCAP) [60]. 

4.4 SCIENTIFIC PROBLEMS IN CLIMATE INFORMATICS 
There are a number of different kinds of problems that climate scientists 
are working on where machine learning and computer science techniques 
may make a big impact. This is a brief description of a few examples 
(with discussion of related work in the literature) that typify these ideas, 
although any specific implementation mentioned should not be consid- 
ered the last word. This section provides short descriptions of several chal- 
lenging problems in climate informatics broadly defined. In Section 4.5 
we present problems in climate data analysis. In subsequent sections we 
delve into more detail on some specific problems in climate informatics. 
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4.4.1 Parameterization Development 

Climate models need to deal with the physics that occurs at scales smaller 
than any finite model can resolve. This can involve cloud formation, tur- 
bulence in the ocean, land surface heterogeneity, ice floe interactions, 
chemistry on dust particle surfaces, etc. Ibis is dealt with by using 
parameterizations that attempt to capture the phenomenology of a spe- 
cific process and its sensitivity in terms of the (resolved) large scales. This 
is an ongoing task, and is currently driven mainly by scientists’ physical 
intuition and relatively limited calibration data. As observational data 
become more available, and direct numerical simulation of key pro- 
cesses becomes more tractable, there is an increase in the potential for 
machine learning and data mining techniques to help define new param- 
eterizations and frameworks. For example, neural network frameworks 
have been used to develop radiation models [50]. 

4.4.2 Using Multimodel Ensembles of Climate Projections 

There are multiple climate models that have been developed and are 
actively being improved at approximately 25 centers across the globe. 
Each model shares some basic features with at least some other models, 
but each has generally been designed and implemented independently 
and has many unique aspects. In coordinated Model Intercomparison 
Projects (MIPs) (most usefully, the Coupled MIP (CMIP3, CMIP5), the 
Atmospheric Chemistry and Climate MIP (ACCMIP), the PaleoClimate 
MIP (PMIP3), etc.), modeling groups have attempted to perform analogous 
simulations with similar boundary conditions but with multiple models. 
These “ensembles” offer the possibility of assessing what is robust across 
models, what are the roles of internal variability, structural uncertainty, 
and scenario uncertainty in assessing the different projections at different 
time and space scales, and multiple opportunities for model- observation 
comparisons. Do there exist skill metrics for model simulations of the 
present and past that are informative for future projections? Are there 
weighting strategies that maximize predictive skill? How would one 
explore this? These are questions that also come up in weather forecasts, 
or seasonal forecasts, but are made more difficult for the climate problem 
because of the long timescales involved [40, 97]. Some recent work has 
applied machine learning to this problem with encouraging results [63]. 
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4.4.3 Paleoreconstructions 

Understanding how climate varied in the past before the onset of wide- 
spread instrumentation is of great interest — not least because the climate 
changes seen in the paleo-record dwarf those seen in the 20th century and 
hence may provide much insight into the significant changes expected this 
century. Paleo-data is, however, even sparser than instrumental data and, 
moreover, is not usually directly commensurate with the instrumental 
record. As mentioned in Section 4.3, paleo-proxies (such as water isotopes, 
tree rings, pollen counts, etc.) are indicators of climate change but often 
have nonclimatic influences on their behavior, or whose relation to what 
would be considered more standard variables (such as temperature or pre- 
cipitation) is perhaps nonstationary or convolved. There is an enormous 
challenge in bringing together disparate, multi-proxy evidence to produce 
large-scale patterns of climate change [59], or from the other direction 
build enough “forward modeling” capability into the models to use the 
proxies directly as modeling targets [76]. This topic is discussed in further 
detail in Section 4.8. 

4.4.4 Data Assimilation and Initialized Decadal Predictions 

The primary way in which sparse observational data are used to construct 
complete fields is through data assimilation. This is a staple of weather fore- 
casts and various reanalyses in the atmosphere and ocean. In many ways, 
this is the most sophisticated use of the combination of models and obser- 
vations, but its use in improving climate predictions is still in its infancy. 
For weather timescales, this works well; but for longer term forecasts 
(seasons to decades), the key variables are in the ocean, not the atmosphere, 
and initializing a climate model so that the evolution of ocean variability 
models the real world in useful ways is very much a work in progress [44, 
90]. First results have been intriguing, if not convincing, and many more 
examples are slated to come online in the new CMIP5 archive [61]. 

4.4.5 Developing and Understanding Perturbed 
Physics Ensembles (PPEs) 

One measure of structural uncertainty in models is the spread among the 
different models from different modeling groups. But these models can- 
not be considered a random sample from the space of all possible models. 
Another approach is to take a single model and, within the code, vary mul- 
tiple (uncertain) parameters in order to generate a family of similar models 
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that nonetheless sample a good deal of the intrinsic uncertainty that arises 
in choosing any specific set of parameter values. These "Perturbed Physics 
Ensembles” (PPEs) have been used successfully in the climateprediction. 
net and Quantifying Uncertainty in Model Predictions (QUMP) projects 
to generate controlled model ensembles that can be compared systemati- 
cally to observed data and make inferences [46, 64], However, designing 
such experiments and efficiently analyzing sometimes thousands of simu- 
lations is a challenge, but one that will increasingly be attempted. 

4.5 CLIMATE DATA ANALYSIS: PROBLEMS AND APPROACHES 

Here we discuss some additional challenge problems in analyzing climate 
data. The rate of data acquisition via satellite network and reanalyses proj- 
ects is very rapid. Similarly, the amount of model output is equally fast 
growing. Model- observation comparisons based on processes (i.e., the 
multivariate changes that occur in a single event [or collection of events] , 
such as a North Atlantic storm, an ocean eddy, an ice floe melting event, 
a hurricane, a jet stream excursion, a stratospheric sudden warming, etc.) 
have the potential to provide very useful information on model credibility, 
physics, and new directions for parameterization improvements. However, 
data services usually deliver data in single-variable, spatially fixed, time- 
varying formats that make it very onerous to apply space and time filters 
to the collection of data to extract generic instances of the process in ques- 
tion. As a first step, algorithms for clustering data streams will be critical 
for clustering and detecting the patterns listed. There will also be the need 
to collaborate with systems and database researchers on the data chal- 
lenges mentioned here and in Section 4.11. Here we present several other 
problems to which cutting-edge data analysis and machine learning tech- 
niques are poised to contribute. 

4.5.1 Abrupt Changes 

Earth system processes form a nonlinear dynamical system and, as a result, 
changes in climate patterns can be abrupt at times [74]. Moreover, there 
is some evidence, particularly in glacial conditions, that climate tends to 
remain in relatively stable states for some period of time, interrupted by 
sporadic transitions (perhaps associated with so-called tipping points) that 
delineate different climate regimes. Understanding the causes behind sig- 
nificant abrupt changes in climate patterns can provide a deeper under- 
standing of the complex interactions between Earth system processes. The 
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first step toward realizing this goal is to have the ability to detect and iden- 
tify abrupt changes from climate data. 

Machine learning methods for detecting abrupt changes, such as 
extensive droughts that last for multiple years over a large region, should 
have the ability to detect changes with spatial and temporal persistence, 
and should be scalable to large datasets. Such methods should be able to 
detect well-known droughts such as the Sahel drought in Africa, the 1930s 
Dust Bowl in the United States, and droughts with similar characteristics 
where the climatic conditions were radically changed for a period of time 
over an extended region [23, 37, 78, 113]. A simple approach for detecting 
droughts is to apply a suitable threshold to a pertinent climate variable, 
such as precipitation or soil moisture content, and label low-precipitation 
regions as droughts. While such an approach will detect major events 
like the Sahel drought and dust bowls, it will also detect isolated events, 
such as low precipitation in one month for a single location that is clearly 
not an abrupt change event. Thus, the number of “false positives” from 
such a simple approach would be high, making subsequent study of each 
detected event difficult. 

To identify drought regions that are spatially and temporally persistent, 
one can consider a discrete graphical model that ensures spatiotemporal 
smoothness of identified regions. Consider a discrete Markov Random 
Field (MRF) with a node corresponding to each location at each time 
step and a meaningful neighborhood structure that determines the edges 
in the underlying graph G = (V,E) [111]. Each node can be in one of two 
states: “normal” or “drought.” The maximum a posteriori (MAP) infer- 
ence problem in the MRF can be posed as 


x* = argmax 

xe{0,l} N 


^^^0« (%u ) ”1" *v) 


xeV 


(u,v)eE 


where 0 U ,0 UV are node-wise and edge-wise potential functions that, respec- 
tively, encourage agreement with actual observations and agreement 
among neighbors; and is the state (i.e., “normal” or “drought”) at node 
u e V. The MAP inference problem is an integer programming problem 
often solved using a suitable linear programming (TP) relaxation [70, 111]. 

Figure 4.1 shows results on drought detection over the past century 
based on the MAP inference method. For the analysis, the Climatic Research 
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Each panel shows the drought starting from a particular decade: 1905-1920 (top 
left), 1921-1930 (top right), 1941-1950 (bottom left), and 1961-1970 (bottom right). 
The regions in black rectangles indicate the common droughts found by [63]. 

Unit (CRU) precipitation dataset was used at 0.5° x 0.5° latitude-longitude 
spatial resolution from 1901 to 2006. The LP involved approximately 
7 million variables and was solved using efficient optimization techniques. 
The method detected almost all well-known droughts over the past cen- 
tury. More generally, such a method can be used to detect and study 
abrupt changes for a variety of settings, including heat waves, droughts, 
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FIGURE 4.1 ( See color insert.) (continued) The drought regions detected by our 

algorithm. Each panel shows the drought starting from a particular decade: 
1905-1920 (top left), 1921-1930 (top right), 1941-1950 (bottom left), and 
1961-1970 (bottom right). The regions in black rectangles indicate the common 
droughts found by [63]. 

precipitation, and vegetation. The analysis can be performed on observed 
data, reanalysis data, as well as model outputs as appropriate. 

4.5.2 Climate Networks 

Identifying dependencies between various climate variables and climate 
processes form a key part of understanding the global climate system. Such 
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dependencies can be represented as climate networks [19, 20, 106, 107], where 
relevant variables or processes are represented as nodes and dependencies 
are captured as edges between them. Climate networks are a rich represen- 
tation for the complex processes underlying the global climate system, and 
can be used to understand and explain observed phenomena [95, 108], 

A key challenge in the context of climate networks is to construct such 
networks from observed climate variables. From a statistical machine 
learning perspective, the climate network should reflect suitable dependen- 
cies captured by the joint distribution of the variables involved. Existing 
methods usually focus on a suitable measure derived from the joint distri- 
bution, such as the covariance or the mutual information. From a sample- 
based estimate of the pairwise covariance or mutual information matrix, 
one obtains the climate network by suitably thresholding the estimated 
matrix. Such approaches have already shown great promise, often identify- 
ing some key dependencies in the global climate system [43] (Figure 4.2). 

Going forward, there are a number of other computational and algorith- 
mic challenges that must be addressed to achieve more accurate representa- 
tions of the global climate system. For instance, current network construction 
methods do not account for the possibility of time-lagged correlations, yet 
we know that such relationships exist. Similarly, temporal autocorrelations 
and signals with varying amplitudes and phases are not explicitly handled. 
There is also a need for better balancing of the dominating signal of spatial 
autocorrelation with that of possible teleconnections (long-range dependen- 
cies across regions), which are often of high interest. In addition, there are 
many other processes that are well known and documented in the climate 
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FIGURE 4.2 ( See color insert.) Climate dipoles discovered from sea-level pres- 

sure (reanalysis) data using graph-based analysis methods (see [42] for details). 
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science literature, and network representations should be able to incor- 
porate this a priori knowledge in a systematic manner. One of the initial 
motivations and advantages of these network-based approaches is their 
interpretability, and it will be critical that this property be retained as 
these various aspects are integrated into increasingly complex models 
and analysis methods. 

4.5.3 Predictive Modeling: Mean Processes and Extremes 
Predictive modeling of observed climatic phenomena can help in under- 
standing key factors affecting a certain observed behavior. While the usual 
goal of predictive modeling is to achieve high accuracy for the response 
variable, for example, the temperature or precipitation at a given location, 
in the context of climate data analysis, identifying the covariates having 
the most significant influence on the response is often more important. 
Thus, in addition to getting high predictive accuracy, feature selection will 
be a key focus of predictive modeling. Further, one needs to differentiate 
between mean processes and extremes, which are rather different regimes 
for the response variable. In practice, different covariates may be influenc- 
ing the response variable under different regimes and timescales. 

In recent literature, important advances have been made in doing fea- 
ture selection in the context of high-dimensional regression [66, 101]. For 
concreteness, consider the problem of predicting the mean temperature in 
Brazil based on multiple ocean variables over all ocean locations. While 
the number of covariates p runs into tens of thousands, the number of 
samples n based on monthly means over a few decades are a few hun- 
dred to a few thousand. Standard regression theory does not extend to this 
n <sc p scenario. Because the ocean variables at a particular location are 
naturally grouped, only a few such locations are relevant for the predic- 
tion, and only a few variables in each such location are relevant, one can 
pose the regression problem as a sparse group lasso problem [24, 25]: 

N 

min lly — X0||~ + ||0 

0eM w "‘" 11 

2=1 



where N is the number of ocean locations, m is the number of ocean vari- 
ables in each location so thatp = Nm, 0 is the weight vector over all covari- 
ates to be estimated, 0 is the set of weights over variables at location g, 
and X, , %2 are nonnegative constants. The sparse group lasso regularizer 
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FIGURE 4.3 Temperature prediction in Brazil: Variables chosen through 
cross-validation. 

ensures that only few locations get non-zero weights, and even among 
these locations, only a few variables are selected. Figure 4.3 shows the 
locations and features that were consistently selected for the task of tem- 
perature prediction in Brazil. 

4.6 SEASONAL CLIMATE FORECASTING 

Seasonal climate forecasts are those beyond the time frame of standard 
weather forecasts (e.g., 2 weeks) out to a season or two ahead (up to 
6 months). Fundamental questions concern what is (and is not) predictable 
and exactly how predictable it is. Addressing these questions also often 
gives a good indication of how to make a prediction in practice. These 
are difficult questions because much in the climate system is unpredict- 
able and the observational record is short. Methods from data mining and 
machine learning applied to observations and data from numerical cli- 
mate prediction models provide promising approaches. Key issues include 
finding components of the climate state-space that are predictable, and 
constructing useful associations between observations and corresponding 
predictions from numerical models. 

4.6.1 What Is the Basis for Seasonal Forecasting? 

The chaotic nature of the atmosphere and the associated sensitivity of 
numerical weather forecasts to their initial conditions is described by 
the well-known “butterfly effect” — that the flap of a butterfly’s wings in 
Brazil could set off a tornado in Texas. Small errors in the initial state 
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of a numerical weather forecast quickly amplify until the forecast has no 
value. This sensitive dependence on initial conditions provides an expla- 
nation for the limited time horizon (a few days to a week) for which use- 
ful weather forecasts can be issued, and the belief until the early 1980s 
that seasonal forecasting was impossible [81]. This also explains why effort 
is needed to “find the needle of predictability in the haystack of chaos.” 
Given the limited predictability of weather, how is it that quantities such 
as precipitation and near-surface temperature are skillfully forecast sea- 
sons (3 to 6 months) in advance? 

First, it should be noted that the format of climate predictions is differ- 
ent from that of weather forecasts. Weather forecasts target the meteor- 
ological conditions of a particular day or hour. Climate predictions are 
made in terms of weather statistics over some time range. For instance, 
the most common quantities in current climate forecasts are 3 -month 
(seasonal) averages of precipitation and near-surface temperature. Two 
fundamental facts about the Earth system make climate forecasts pos- 
sible. First, the oceans evolve on timescales that are generally slower than 
those of the atmosphere, and some ocean structures are predictable sev- 
eral months in advance. The outstanding predictable ocean structure is 
associated with the El Nino-Southern Oscillation (ENSO) and is manifest 
in the form of widespread, persistent departures (anomalies) of equatorial 
Pacific sea surface temperature (SST) from its seasonally adjusted long- 
term value. The first ENSO forecasts were made in the late 1980s [10]. The 
second fact is that some components of the atmosphere respond to per- 
sistent SST anomalies. The atmospheric response to SST on any given day 
tends to be small relative to the usual weather variability. However, because 
the SST forcing and the associated atmospheric response may persist for 
months or seasons, the response of a seasonal average to SST forcing may 
be significant [82]. For instance, ENSO has impacts on temperature, pre- 
cipitation, tropical cyclones, human health, and perhaps even conflict 
[31, 38, 49, 72], Early seasonal forecasts constructed using canonical cor- 
relation analysis (CCA) between antecedent SST and climate responses [3] 
took advantage of this persistence of SST. Such statistical (or empirical, in 
the sense of not including explicit fundamental physical laws) forecasts 
remain attractive because of their generally low dimensional and cost rela- 
tive to physical process-based models (typically general circulation mod- 
els; GCMs) with many millions of degrees of freedom. 
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4.6.2 Data Challenges 

Here we introduce some challenges posed by the available data. Data chal- 
lenges are further discussed in Section 4.11. Serious constraints come from 
the dimensions of the available data. Reliable climate observations often 
do not extend more than 40 or 50 years into the past. This means that, for 
example, there may be only 40 or 50 available observations of January- 
March average precipitation. Moreover, the quality and completeness of 
that data may vary in time and space. Climate forecasts from GCMs often 
do not even cover this limited period. Many seasonal climate forecast 
systems started hindcasts in the early 1980s when satellite observations, 
particularly of SST, became available. In contrast to the sample size, the 
dimension of the GCM state-space may be of the order 10 6 , depending 
on spatial grid resolution. Dimension reduction (principal component 
analysis [PCA] is commonly used) is necessary before applying classical 
methods like canonical correlation analysis to find associated features in 
predictions and observations [5]. There has been some use of more sophis- 
ticated dimensionality reduction methods in seasonal climate prediction 
problems [53]. Methods that can handle large state-spaces and small sam- 
ple size are needed. An intriguing recent approach that avoids the problem 
of small sample size is to estimate statistical models using long climate 
simulations unconstrained by observations and test the resulting model 
on observations [18, 115]. This approach has the challenge of selecting 
GCMs whose climate variability is “realistic,” which is a remarkably dif- 
ficult problem given the observational record. 

4.6.3 Identifying Predictable Quantities 

The initial success of climate forecasting has been in the prediction of sea- 
sonal averages of quantities such as precipitation and near-surface tem- 
perature. In this case, time averaging serves as a filter with which to find 
predictable signals. A spatial average of SST in a region of the equatorial 
Pacific is used to define the NIN03.4 index, which is used in ENSO fore- 
casts and observational analysis. This spatial average serves to enhance 
the large-scale predictable ENSO signal by reducing noise. The Madden- 
Julian Oscillation (MJO) is a sub-seasonal component of climate variabil- 
ity that is detected using time and space filtering. There has been some 
work on constructing spatial filters that were designed to optimize mea- 
sures of predictability [17] and there are opportunities for new methods 
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that incorporate optimal time and space filtering and that optimize more 
general measures of predictability. 

While predicting the weather of an individual day is not possible in a 
seasonal forecast, it may be possible to forecast statistics of weather such as 
the frequency of dry days or the frequency of consecutive dry days. These 
quantities are often more important to agriculture than seasonal totals. 
Drought has a complex time-space structure that depends on multiple 
meteorological variables. Data mining and machine learning (DM/ML) 
methods can be applied to observations and forecasts to identify drought, 
as discussed in Section 4.5. 

Identification of previously unknown predictable climate features may 
benefit from the use of DM/ML methods. Cluster analysis of tropical 
cyclone tracks has been used to identify features that are associated with 
ENSO and MJO variability [9]. Graphical models, the nonhomogeneous 
Hidden Markov Model in particular, have been used to obtain stochastic 
daily sequences of rainfall conditioned on GCM seasonal forecasts [32]. 

The time and space resolution of GCM forecasts limits the physi- 
cal phenomena they can resolve. However, they may be able to predict 
proxies or large-scale antecedents of relevant phenomena. For instance, 
GCMs that do not resolve tropical cyclones (TCs) completely do form TC- 
like structures that can be used to make TC seasonal forecasts [8, 110]. 
Identifying and associating GCMs “proxies” with observed phenomena is 
also a DM/MT problem. 

Regression methods are used to connect climate quantities to associ- 
ated variables that are either unresolved by GCMs or not even climate 
variables. For instance, Poisson regression is used to relate large-scale cli- 
mate quantities with hurricanes [104], and generalized additive models 
are used to relate heat waves with increased mortality [68]. Again, the 
length of the observational record makes this challenging. 

4.6.4 Making the Best Use of GCM Data 

Data from multiple GCM climate forecasts are routinely available. 
However, converting that data into a useful forecast product is a nontrivial 
task. GCMs have systematic errors that can be identified (and potentially 
corrected) through regression-like procedures with observations. Robust 
estimates of uncertainty are needed to construct probabilistic forecasts. 
Because forecasts are available from multiple GCMs, another question is 
how best to combine information from multiple sources, given the rela- 
tively short observation records with which to estimate model performance. 
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4.7 CLIMATE EXTREMES, UNCERTAINTY, AND IMPACTS 

4.7.1 The Climate Change Challenge 

The Fourth Assessment Report of the Intergovernmental Panel on 
Climate Change (IPCC, AR4) has resulted in wider acceptance of global 
climate change caused by anthropogenic drivers of emission scenarios. 
However, earth system modelers struggle to develop precise predictions 
of extreme events (e.g., heat waves, cold spells, extreme rainfall events, 
droughts, hurricanes, and tropical storms) or extreme stresses (e.g., tropi- 
cal climate in temperate regions or shifting rainfall patterns) at regional 
and decadal scales. In addition, the most significant knowledge gap rel- 
evant for policy makers and stakeholders remains the inability to produce 
credible estimates of local-to-regional scale climate extremes and change 
impacts. Uncertainties in process studies, climate models, and associated 
spatiotemporal downscaling strategies may be assessed and reduced by 
statistical evaluations. But a similar treatment for extreme hydrological 
and meteorological events may require novel statistical approaches and 
improved downscaling. Scenario uncertainty for climate change impacts 
is fundamentally intractable, but other sources of uncertainty may be 
amenable to reduction. Regional impacts need to account for additional 
uncertainties in the estimates of anticipatory risks and damages, whether 
on the environment, infrastructures, economy, or society. The cascading 
uncertainties from scenarios, to models, to downscaling, and finally to 
impacts, make costly decisions difficult to assess. This problem grows acute 
if credible attributions must be made to causal drivers or policy impacts. 

4.7.2 The Science of Climate Extremes 

One goal is to develop quantifiable insights on the impacts of global cli- 
mate change on weather or hydrological extreme stresses and extreme 
events at regional to decadal scales. Precise and local predictions, for 
example, the likelihood of an extreme event on a given day of any given 
year a decade later, will never be achievable, owing to the chaotic nature 
of the climate system as well as the limits to precision of measurements 
and our inability to model all aspects of the process physics. However, 
probability density functions of the weather and hydrology, for example, 
likelihoods of intensity-duration-frequency (IDF) of extreme events or of 
mean change leading to extreme stress, may be achievable targets. The 
tools of choice range from the two traditional pillars of science: theory 
(e.g., advances in physical understanding and high-resolution process 
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models of atmospheric or oceanic climate, weather, or hydrology) to 
experimentation (e.g., development of remote and in-situ sensor systems 
as well as related cyber-infrastructures to monitor the Earth and environ- 
mental systems). However, perhaps the most significant breakthroughs 
are expected from the relatively new pillars: computational sciences and 
informatics. Research in the computational sciences for climate extremes 
science include the computational data sciences (e.g., high-performance 
analytics based on extreme value theory and nonlinear data sciences to 
develop predictive insights based on a combination of observations and 
climate model simulations) and computational modeling (e.g., regional 
scale climate models, models of hydrology, improvements in high- 
resolution processes within general circulation models, as well as feed- 
back to model development based on comparisons of simulations with 
observations), while the informatics aspects include data management 
and discovery (e.g., development of methodologies for geographic data 
integration and management, knowledge discovery from sensor data, and 
geospatial-temporal uncertainty quantification). 

4.7.3 The Science of Climate Impacts 

The study of climate extremes is inextricably linked to the study of 
impacts, including risks and damage assessments as well as adaptation 
and mitigation strategies. Thus, an abnormally hot summer or high occur- 
rence of hurricanes in unpopulated or remote regions of the world, which 
do not otherwise affect resources or infrastructures, have little or no cli- 
mate impact on society. On the other hand, extreme events such as the 
aftereffects of Hurricane Katrina have extreme impacts owing to complex 
interactions among multiple effects: a large hurricane hitting an urban 
area, an already vulnerable levee breaking down because of the flood 
waters, as well as an impacted society and response systems that are nei- 
ther robust nor resilient to shocks. In general, climate change mitigation 
(e.g., emission policies and regulations to possible weather modification 
and geoengineering strategies) and adaptation (e.g., hazards and disaster 
preparedness, early warning and humanitarian assistance or the manage- 
ment of natural water, nutritional and other resources, as well as possible 
migration and changes in regional population growth or demographics) 
must be based on actionable predictive insights that consider the inter- 
action of climate extremes science with the critical infrastructures and 
key resources, population, and society. While the science of impacts can 
be challenging and relatively difficult to quantify, given recent advances 
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in machine learning, geospatial modeling, data fusion, and Geographic 
Information Systems (GIS), this is a fertile area for progress on cli- 
mate informatics. 

4.8 RECONSTRUCTING PAST CLIMATE 

The most comprehensive observations of Earth’s climate span only the 
past one to two hundred years [105]. This time period includes the estab- 
lishment of long-term and widespread meteorological stations across the 
continental landmasses [6], ocean observing networks from ships and 
buoys [114] and, within the more recent past, remote sensing from satel- 
lites [109]. Much of our understanding about the climate system and con- 
temporary climate change comes from these and related observations and 
their fundamental role in evaluating theories and models of the climate 
system. Despite the valuable collection of modern observations, however, 
two factors limit their use as a description of the Earth’s climate and its 
variability: (1) relative to known timescales of climate variability, they 
span a brief period of time; and (2) much of the modern observational 
interval is during an emergent and anomalous climate response to anthro- 
pogenic emissions of greenhouse gases [36]. Both of these factors limit 
assessments of climate variability on multi-decadal and longer times- 
cales, or characterizations of climatic mean states under different forcing* 
scenarios (e.g., orbital configurations or greenhouse gas concentrations). 
Efforts to estimate climate variability and mean states prior to the instru- 
mental period are thus necessary to fully characterize how the climate can 
change and how it might evolve in the future in response to increasing 
greenhouse gas emissions. 

Paleoclimatology is the study of Earth’s climate history and offers 
estimates of climate variability and change over a range of timescales 
and climate regimes. Among the many time periods of relevance, the 
Common Era (CE; the past two millennia) is an important target because 
the abundance of high-resolution paleoclimatic proxies (e.g., tree rings, 
ice cores, cave deposits, corals, and lake sediments) over this time interval 
allows seasonal-to-annual reconstructions on regional-to-global spatial 

A “forcing” is a specific driver of climate change, external to the climate models — for instance, 
changes in the composition of well-mixed greenhouse gases (e.g., C0 2 or CH 4 ), changes in the 
land surface due to deforestation or urbanization, changes in air pollution, changes in the sun’s 
input, or the impact of large volcanic eruptions. Each forcing can be usefully characterized by the 
impact it has on the radiative balance at the top of the atmosphere: positive forcings increase 
the energy coming into the climate system and hence warm the planet, while negative forcings 
cool the climate. 
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scales [40]. The CE also spans the rise and fall of many human civiliza- 
tions, making paleoclimatic information during this time period impor- 
tant for understanding the complicated relationships between climate and 
organized societies [7, 15]. 

Given the broad utility and vast number of proxy systems that are 
involved, the study of CE climate is a wide-ranging and diverse enterprise. 
The purpose of the following discussion is not meant to survey this field 
as a whole, but instead to focus on a relatively recent pursuit in CE paleo- 
climatology that seeks to reconstruct global or hemispheric temperatures 
using syntheses of globally distributed multi-proxy networks. This par- 
ticular problem is one that may lend itself well to new and emerging data 
analysis techniques, including machine learning and data mining meth- 
ods. The motivation of the following discussion therefore is to outline the 
basic reconstruction problem and describe how methods are tested in syn- 
thetic experiments. 

4.8.1 The Global Temperature Reconstruction Problem 
It is common to separate global or hemispheric (large-scale) tempera- 
ture reconstruction methods into two categories. The first involves index 
methods that target large-scale indices such as hemispheric mean tem- 
peratures [13, 35, 51, 58]; the second comprises climate field reconstruc- 
tion (CFR) methods that target large-scale patterns, that is, global maps of 
temperature change [21, 55, 56, 59, 88]. Although both of these approaches 
often share common methodological foundations, the following discus- 
sion focuses principally on the CFR problem. 

Large-scale temperature CFRs rely on two primary data sets. The first is 
monthly or annual gridded (5° latitude x 5° longitude) temperature prod- 
ucts that have near-global coverage beginning in the mid-to-late 19th cen- 
tury. These gridded temperature fields have been derived from analyses 
of land- and sea-based surface temperature measurements from meteoro- 
logical stations, and ship- and buoy-based observing networks [6, 42], The 
second dataset comprises collections of multiple climate proxy archives 
[58], each of which has been independently analyzed to establish their sen- 
sitivity to some aspect of local or regional climate variability. These proxy 
records are distributed heterogeneously about the globe (Figure 4.4), 
span variable periods of time, and each is subject to proxy-specific errors 
and uncertainties. 

The basic premise of CFR techniques is that a relationship can be deter- 
mined between observed climate fields and multi-proxy networks during 
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FIGURE 4.4 (a) Representation of the global distribution of the most up-to-date 

global multi-proxy network used by Mann et al. [58]. Grey squares indicate the 5° 
grid cells that contain at least one proxy in the unscreened network from ref. [58]. 
(b) Schematic of the data matrix for temperature field reconstructions spanning 
all or part of the CE. Grey regions in the data matrix are schematic representa- 
tions of data availability in the instrumental temperature field and the multi- 
proxy matrix. White regions indicate missing data in the various sections of the 
data matrix. 

their common interval of overlap. Once defined, this relationship can be 
used to estimate the climate fields prior to their direct measurement using 
the multi-proxy network that extends further into the past. Figure 4.4 
represents this concept schematically using a data matrix that casts the 
CFR formalism as a missing data problem. Note that this missing data 
approach was originally proposed for CFRs using regularized expectation 
maximization [77], and has since become a common method for recon- 
structions targeting the CE [56, 57, 59]. The time-by-space data matrix 
in Figure 4.4 is constructed first from the instrumental data, with rows 
corresponding to years and columns corresponding to the number of grid 
cells in the instrumental field. For a typical CFR targeting an annual and 
global 5° x 5° temperature field, the time dimension is several centuries to 
multiple millennia, and the space dimension is on the order of one to two 
thousand grid cells. The time dimension of the data matrix is determined 
by the length of the calibration interval during which time the temper- 
ature observations are available, plus the reconstruction interval that is 
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determined by the length of available proxy records. The number of spatial 
locations may be less than the 2,592 possible grid cells in a 5° global grid, 
and depends on the employed surface temperature analysis product. A 
reconstruction method may seek to infill grid cells that are missing tem- 
perature observations [103], or simply leave them missing, depending on 
the number of years that they span [59]. The second part of the composite 
data matrix is formed from the multi-proxy network, the dimensions of 
which are determined by the longest proxy records and the total number 
of proxies (typically on the order of a few hundred to a thousand). The 
number of records in multi-proxy networks typically decreases back in 
time, and may reduce to a few tens of records in the earliest period of the 
reconstruction interval. The temporal resolution of the proxy series may 
also vary from seasonal to decadal. 

Multiple methods have been used for CFRs, including a number of new 
and emerging techniques within Bayesian frameworks [52, 103]. The vast 
majority of CFRs to date, however, have applied forms of regularized, multi- 
variate linear regression, in which a linear regression operator is estimated 
during a period of overlap between the temperature and proxy matrices. 
Such linear regression approaches work best when the time dimension in 
the calibration interval (Figure 4.4) is much larger than the spatial dimen- 
sion, because the covariance between the temperature field and the prox- 
ies is more reliably estimated. The challenge for CFR methods involves 
the manner in which the linear regression operator is estimated in practi- 
cal situations when this condition is not met. It is often the case in CFR 
applications that the number of target variables exceeds the time dimen- 
sion, yielding a rank- deficient problem. The linear regression formalism 
therefore requires some form of regularization. Published linear methods 
for global temperature CFRs vary primarily in their adopted form of reg- 
ularization (see [88, 102] for general discussions on the methodological 
formalism). Matrix factorizations such as Singular Value Decomposition 
[29] of the temperature and proxy matrices are common first steps. If the 
squared singular values decrease quickly, as is often the case in climato- 
logical data where leading climate patterns dominate over many more 
weakly expressed local patterns or noise, reduced-rank representations of 
the temperature and proxy matrices are typically good approximations of 
the full-rank versions of the matrices. These reduced-rank temperature 
and proxy matrices therefore are used to estimate a linear regression oper- 
ator during the calibration interval using various multivariate regression 
techniques. Depending on the method used, this regression operator may 
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be further regularized based on analyses of the cross-covariance or cor- 
relation of the reduced temperature and proxy matrices. Multiple means 
of selecting rank reductions at each of these steps have been pursued, such 
as selection rules based on analyses of the singular value (or eigenvalue) 
spectrum [57] or minimization of cross-validation statistics calculated for 
the full range of possible rank- reduction combinations [88]. 

4.8.2 Pseudoproxy Experiments 

The literature is replete with discussions of the variously applied CFR 
methods and their performance (see [29] for a cogent summary of many 
employed methods). Given this large number of proposed approaches, it 
has become important to establish means of comparing methods using 
common datasets. An emerging tool for such comparisons is millennium- 
length, forced transient simulations from Coupled General Circulation 
Models (CGCMs) [1, 30]. These model simulations have been used as syn- 
thetic climates in which to evaluate the performance of reconstruction 
methods in tests that have been termed pseudoproxy experiments (see 
[85] for a review). The motivation for pseudoproxy experiments is to adopt 
a common framework that can be systematically altered and evaluated. 
They also provide a much longer, albeit synthetic, validation period than 
can be achieved with real-world data, and thus methodological evalua- 
tions can extend to lower frequencies and longer timescales. Although one 
must always be mindful of how the results translate into real-world impli- 
cations, these design attributes allow researchers to test reconstruction 
techniques beyond what was previously possible and to compare multiple 
methods on common datasets. 

The basic approach of a pseudoproxy experiment is to extract a por- 
tion of a spatiotemporally complete CGCM field in a way that mimics the 
available proxy and instrumental data used in real-world reconstructions. 
The principal experimental steps proceed as follows: (1) pseudoinstru- 
mental and pseudoproxy data are subsampled from the complete CGCM 
field from locations and over temporal periods that approximate their 
real-world data availability; (2) the time series that represent proxy infor- 
mation are added to noise series to simulate the temporal (and in some 
cases spatial) noise characteristics that are present in real-world proxy 
networks; and (3) reconstruction algorithms are applied to the model- 
sampled pseudo-instrumental data and pseudoproxy network to produce 
a reconstruction of the climate simulated by the CGCM. The culminating 
fourth step is to compare the derived reconstruction to the known model 
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target as a means of evaluating the skill of the applied method and the 
uncertainties expected to accompany a real-world reconstruction product. 
Multi-method comparisons can also be undertaken from this point. 

Multiple datasets are publicly available for pseudoproxy experiments 
through supplemental Websites of published papers [57, 87, 89, 103], 
The Paleoclimate Reconstruction Challenge is also a newly established 
online portal through the Paleoclimatology Division of the National 
Oceanographic and Atmospheric Administration that provides additional 
pseudoproxy datasets.* This collection of common datasets is an important 
resource for researchers wishing to propose new methodological applica- 
tions for CFRs, and is an excellent starting point for these investigations. 

4.8.3 Climate Reconstructions and the Future 

More than a decade of research on deriving large-scale temperature recon- 
structions of the CE has yielded many insights into our past climate and 
established the utility of such efforts as a guide to the future. Important 
CFR improvements are nevertheless still necessary and leave open the 
potential for new analysis methods to have significant impacts on the 
field. Broad assessments of the multivariate linear regression framework 
have shown the potential for variance losses and mean biases in recon- 
structions on hemispheric scales [13, 51, 86], although some methods have 
demonstrated significant skill for reconstructions of hemispheric and 
global indices [57]. The spatial skill of CFRs, however, has been shown 
in pseudoproxy experiments to vary widely, with some regions showing 
significant errors [89]. Establishing methods with improved spatial skill 
is therefore an important target for alternative CFR approaches. It also 
is critical to establish rigorous uncertainty estimates for derived recon- 
structions by incorporating a more comprehensive characterization of 
known errors into the reconstruction problem. Bayesian and ensemble 
approaches lend themselves well to this task and constitute another open 
area of pursuit for new methodological applications. Process-based char- 
acterizations of the connection between climate and proxy responses also 
are becoming more widely established [2, 22, 76, 100]. These developments 
make it possible to incorporate physically based forward models as con- 
straints on CFR problems and further open the possibility of methodolog- 
ical advancement. Recent Bayesian studies have provided the groundwork 
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for such approaches [52, 103], while paleoclimatic assimilation techniques 
have also shown promise [112]. 

In the context of machine learning, the problem of reconstructing parts 
of a missing data matrix has been widely studied as the matrix completion 
problem (see Figure 4.4). A popular example of the problem is encoun- 
tered in movie recommendation systems, in which each user of a given 
system rates a few movies out of tens of thousands of available titles. 
The system subsequently predicts a tentative user rating for all possible 
movies, and ultimately displays the ones that the user might like. Unlike 
traditional missing value imputation problems where a few entries in a 
given data matrix are missing, in the context of matrix completion, one 
works with mostly missing entries (e.g., in movie recommendation sys- 
tems, 99% or more of the matrix is typically missing). Low-rank matrix 
factorization methods have been shown to be quite successful in such 
matrix completion problems [48, 73]. Further explorations of matrix com- 
pletion methods for the paleoclimate reconstruction problem therefore are 
fully warranted. This includes investigations into the applicability of exist- 
ing methods, such as probabilistic matrix factorization [73] or low-rank 
and sparse decompositions [114], as well as explorations of new methods 
that take into account aspects specific to the paleoclimate reconstruction. 
Methods that can perform completions along with a confidence score are 
more desirable because uncertainty quantification is an important desid- 
eratum for paleoclimate. 

Finally, it is important to return to the fact that extensive method- 
ological work in the field of CE paleoclimatology is aimed, in part, at 
better constraining natural climate variability on decadal-to-centennial 
timescales. This timescale of variability, in addition to expected forced 
changes, will be the other key contribution to observed climate during the 
21st century. Whether we are seeking improved decadal predictions [93] 
or refined projections of 20th century regional climate impacts [28], these 
estimates must incorporate estimates of both forced and natural variabil- 
ity. It therefore is imperative that we fully understand how the climate 
naturally varies across a range of relevant timescales, how it changes when 
forced, and how these two components of change may couple together. 
This understanding cannot be achieved from the modern instrumental 
record alone, and the CE is a strategic paleoclimate target because it pro- 
vides both reconstructions with high temporal and spatial resolution and 
an interval over which CGCM simulations are also feasible. Combining 
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these two sources of information to assess model projections of future cli- 
mate therefore is itself an important future area of discovery. Analyses 
that incorporate both the uncertainties in paleoclimatic estimates and the 
ensemble results of multiple model simulations will be essential for these 
assessments and is likely a key component of climate informatics as the 
field evolves into the future. 

4.9 APPLICATIONS TO PROBLEMS IN POLAR REGIONS 

Another potential application of machine learning concerns the impact 
of climate change at the poles and the interaction between the poles and 
climate in general. Because of the difficulty in collecting data from polar 
regions, the relatively expensive costs, and logistics, it is important to 
maximize the potential benefit deriving from the data. The paucity of 
surface-measured data is complemented by the richness and increasing 
volume of either satellite/airborne data and model outputs. In this regard, 
powerful tools are needed — not only to analyze, manipulate, and visual- 
ize large datasets, but also to search and discover new information from 
different sources — in order to exploit relationships between data and pro- 
cesses that are not evident or captured by physical models. 

The number of applications of machine learning to study polar regions 
is not high although it has been increasing over the past decade. This is 
especially true in those cases when data collected from spaceborne sensors 
are considered. For example, Tedesco and colleagues [98, 99] use artificial 
neural networks (ANNs) or genetic algorithms to estimate snow param- 
eters from spaceborne microwave observations. Soh and Tsatsoulis [91] 
use an Automated Sea Ice Segmentation (ASIS) system that automatically 
segments Synthetic Aperture Radar (SAR) sea ice imagery by integrating 
image processing, data mining, and machine learning methodologies. The 
system is further developed by Soh et al. [92], where an intelligent sys- 
tem for satellite sea ice image analysis called Advanced Reasoning using 
Knowledge for Typing Of Sea ice (ARKTOS) “mimicking the reason- 
ing process of sea ice experts” is presented. Tu and Teen [54] use semi- 
supervised learning to separate snow and non-snow areas over Greenland 
using a multispectral approach. Reusch [71] applies tools from the field 
of ANNs to reconstruct centennial-scale records of West Antarctic sea- 
ice variability using ice- core datasets from 18 West Antarctic sites and 
satellite-based records of sea ice. ANNs are used as a nonlinear tool to 
ice-core predictors to sea-ice targets such as sea salt chemistry to sea ice 
edge. One of the results from this study is that, in general, reconstructions 
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are quite sensitive to predictor used, and not all predictors appear to be 
useful. Finally, Gilford [27] shows a detailed study of team learning, col- 
laboration, and decision applied to ice-penetrating radar data collected in 
Greenland in May 1999 and September 2007 as part of a model- creation 
effort for subglacial water presence classification. 

The above-mentioned examples represent a few cases where machine 
learning tools have been applied to problems focusing on studying the 
polar regions. Although the number of studies appears to be increasing, 
likely because of both the increased research focusing on climate change 
and the poles and the increased computational power allowing machine 
learning tools to expand in their usage, they are still relatively rare com- 
pared to simpler but often less efficient techniques. 

Machine learning and data mining can be used to enhance the value 
of the data by exposing information that would not be apparent from 
single-dataset analyses. For example, identifying the link between dimin- 
ishing sea ice extent and increasing melting in Greenland can be done 
through physical models attempting to model the connections between 
the two through the exchange of atmospheric fluxes. However, large-scale 
connections (or others at different temporal and spatial scales) might be 
revealed through the use of data-driven models or, in a more sophisticated 
fashion, through the combination of both physical and data-driven mod- 
els. Such an approach would, among other things, overcome the limitation 
of the physical models that, even if they represent the state-of-the-art in 
the corresponding fields, are limited by our knowledge and understanding 
of the physical processes. ANNs can also be used in understanding not 
only the connections among multiple parameters (through the analysis of 
the neurons connections), but also to understand potential temporal shifts 
in the importance of parameters on the overall process (e.g., increased 
importance of albedo due to the exposure of bare ice and reduced solid 
precipitation in Greenland over the past few years). Applications are not 
limited to a pure scientific analysis but also include the management of 
information, error analysis, missing linkages between databases, and 
improving data acquisition procedures. 

In synthesis, there are many areas in which machine learning can 
support studies of the poles within the context of climate and climate 
change. These include climate model parameterizations and multimodel 
ensembles of projections for variables such as sea ice extent, melting in 
Greenland, and sea-level rise contribution, in addition to those discussed 
in previous sections. 
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4.10 TOWARD A CLIMATE INFORMATICS TOOLBOX 

Recent additions to the toolbox of modern machine learning have consid- 
erable potential to contribute to and greatly improve the prediction and 
inference capability for climate science. Climate prediction has significant 
challenges, including high dimensionality, multiscale behavior, uncer- 
tainty, and strong nonlinearity, and also benefits from having historical 
data and physics-based models. It is imperative that we bring all available, 
relevant tools to bear on the climate arena. In addition to the methods 
cited in Section 4.2 and in subsequent sections, here we briefly describe 
several other methods (some proposed recently) that one might consider 
to apply to problems in climate science. 

We begin with CalTech and Los Alamos National Laboratory’s recently 
developed Optimal Uncertainty Quantification (OUQ) formalism [67, 79]. 
OUQ is a rigorous, yet practical, approach to uncertainty quantification 
that provides optimal bounds on uncertainties for a given, stated set of 
assumptions. For example, OUQ can provide a guarantee that the proba- 
bility that a physical variable exceeds a cutoff is less than some value e. This 
method has been successfully applied to assess the safety of truss struc- 
tures to seismic activity. In particular, OUQ can provide the maximum 
and minimum values of the probability of failure of a structure as a func- 
tion of earthquake magnitude. These probabilities are calculated by solv- 
ing an optimization problem that is determined by the assumptions in the 
problem. As input, OUQ requires a detailed specification of assumptions. 
One form of assumption may be (historical) data. The method’s poten- 
tial for practical use resides in a reduction from an infinite-dimensional, 
nonconvex optimization problem to a finite- (typically low) dimensional 
one. For a given set of assumptions, the OUQ method returns one of three 
answers: (1) Yes, the structure will withstand the earthquake with prob- 
ability greater thanp; (2) No, it will not withstand it with probability p; or 
(3) Given the input, one cannot conclude either (i.e., undetermined). In the 
undetermined case, more/different data/assumptions are then required to 
say something definite. Climate models are typically infinite-dimensional 
dynamical systems, and a given set of assumptions will reduce this to a 
finite-dimensional problem. The OUQ approach could address such ques- 
tions as whether (given a potential scenario) the global mean temperature 
increase will exceed some threshold T, with some probability e. 

To improve the performance (e.g., reduce the generalization error) in 
statistical learning problems, it sometimes helps to incorporate domain 
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knowledge. This approach is particularly beneficial when there is limited 
data from which to learn, as is often the case in high-dimensional problems 
(genomics is another example). This general philosophy is described in a 
number of approaches such as learning with side information, Universum 
Teaming [84], and learning from non-examples [83]. Teaming with the 
Universum and learning from non-examples involve augmenting the 
available data with related examples from the same problem domain, but 
not necessarily from the same distribution. Quite often, the generalization 
error for predictions can be shown to be smaller for carefully chosen aug- 
mented data, but this is a relatively uncharted field of research and it is not 
yet known how to use this optimally. One can imagine using an ensemble 
of climate models in conjunction with data from model simulations to 
improve predictive capacity. How to optimally select Universum or non- 
examples is an open problem. 

Domain knowledge in the form of competing models provides the basis 
of a game-theoretic approach of model selection [11], This relates to recent 
work in applying algorithms for online learning with experts to combin- 
ing the predictions of the multimodel ensemble of GCMs [63]. On his- 
torical data, this online learning algorithm’s average prediction loss nearly 
matched that of the best performing climate model. Moreover, the perfor- 
mance of the algorithm surpassed that of the average model prediction, 
which is a common state-of-the-art method in climate science. A major 
advantage of these approaches, as well as game-theoretic formulations, 
is their robustness, including the lack of assumptions regarding linearity 
and noise. However, because future observations are missing, algorithms 
for unsupervised or semi-supervised learning with experts should be 
developed and explored. 

Conformal prediction is a recently developed framework for learning 
based on the theory of algorithmic randomness. The strength of confor- 
mal prediction is that it allows one to quantify the confidence in a predic- 
tion [80]. Moreover, the reliability of the prediction is never overestimated. 
This is, of course, very important in climate prediction. To apply the suite 
of tools from conformal prediction, however, one needs to have iid (inde- 
pendent, identically distributed) or exchangeable data. While this is a 
serious restriction, one can imagine using iid computer simulations and 
checking for robustness. Conformal prediction is fairly easy to use and 
can be implemented as a simple wrapper to existing classifiers or regres- 
sion algorithms. Conformal prediction has been applied successfully in 
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genomics and medical diagnoses. It is likely worthwhile to apply confor- 
mal prediction to other complex problems in computational science. 

Statistical Relational Learning [26] offers a natural framework for infer- 
ence in climate. Included within this set of methods are graphical models 
[47], a flexible and powerful formalism with which to carry out inference for 
large, highly complex systems (some of which were discussed in Sections 4.5 
and 4.6). At one extreme, graphical models can be derived solely from data. 
At the other extreme, graphical models provide a generalization of Kalman 
filters or smoothers, where data are integrated with a model. This general 
approach is quite powerful but requires efficient computation of conditional 
probabilities. As a result, one might explore how to adapt or extend the cur- 
rent suite of belief propagation methods to climate-specific problems. 

Finally, for all of the above methods, it would be helpful if the learn- 
ing algorithm could automatically determine which information or data it 
would be useful to get next. The “optimal learning” formalism addresses 
this question [69]. This gradient learning approach can be applied to a whole 
host of problems for learning where one has limited resources to allocate 
for information gathering. Optimal learning has been applied successfully 
to experiment design, in particular in the pharmaceutical industry, where 
it has the potential to reduce the cost (financial, time, etc.) of the drug 
discovery process. Optimal learning might be applied to climate science 
in order to guide the next set of observations and/or the next simulations. 

To conclude, there is a suite of recently developed machine learning 
methods whose applicability and usefulness in climate science should be 
explored. At this point, we have only begun to scratch the surface. If these 
methods prove successful in climate studies, we would expect them to 
apply elsewhere — where one has a model of the physical system and can 
access data. 

4.11 DATA CHALLENGES AND OPPORTUNITIES 

IN CLIMATE INFORMATICS 

Here we discuss additional challenges and important issues in analyzing 
climate data. 

4.11.1 Issues with Cross-Class Comparisons 

There is often a need to compare across different classes of data, whether to 
provide ground truth for a satellite retrieval or to evaluate a climate model 
prediction or to calibrate a proxy measurement. But because of the different 
characteristics of the data, comparing “apples to apples” can be difficult. 
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One of the recurring issues is the difference between internal variabil- 
ity (or weather) and climate responses tied to a specific external forcing. 
The internal variability is a function of the chaotic dynamics in the atmo- 
sphere and cannot be predicted over time periods longer than 10 days or 
so (see Section 4.6). This variability, which can exist on all timescales, 
exists also in climate models; but because of the sensitive dependence on 
initial conditions, any unique simulation will have a different realization 
of the internal variability. Climate changes are then effectively defined as 
the ensemble mean response (i.e., after averaging out any internal vari- 
ability). Thus, any single realization (such as the real-world record) must 
be thought of as a forced signal (driven by external drivers) combined with 
a stochastic weather component. 

The internal variability increases in relative magnitude as a function 
of decreasing time or spatial scale. Thus, comparisons of the specific 
time evolution of the climate system need to either take the variability 
into account or use specific techniques to minimize the difference from 
the real world. For instance, “nudged” simulations use observed winds 
from the reanalyses to keep the weather in the model loosely tied to the 
observations. Simulations using the observed ocean temperatures as a 
boundary condition can do a good job of synchronizing the impacts of 
variability in the ocean on the atmospheric fields. Another way to mini- 
mize the impact of internal variability is to look for property-to-property 
correlations to focus on specific processes that, although they may occur 
at different points in time or space, can nonetheless be compared across 
models and observations. 

Another issue is that model output does not necessarily represent exact 
topography or conditions related to an in-situ observation. The aver- 
age height of a specific grid box might not correspond to the height of a 
mountain-based observing platform, or the resolved shape of the coastline 
might make a difference of 200 kilometers or so in the distance of a station 
to the shore. These issues can be alleviated to some extent if comparisons 
focus on large-scale gridded data. Another technique is to “downscale” the 
model output to specific locations, either statistically (based on observed 
correlations of a local record to larger- scale features of the circulation) or 
dynamically (using an embedded RCM). These methods have the poten- 
tial to correct for biases in the large-scale model, but many practical issues 
remain in assessing by how much. 

Finally, observations are a function of a specific observing methodol- 
ogy that encompasses technology, practice, and opportunity. These factors 
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can impart a bias or skewness to the observation relative to what the real 
world may nominally be doing. Examples in satellite remote sensing are 
common — for example, a low cloud record from a satellite will only be 
able to see low clouds when there are no high clouds. Similarly, a satellite 
record of “mid-tropospheric” temperatures might actually be a weighted 
integral of temperatures from the surface to the stratosphere. A paleo- 
climate record may be of a quantity that while related to temperature or 
precipitation, may be a complex function of both, weighted towards a spe- 
cific season. In all these cases, it is often advisable to create a ‘forward 
model’ of the observational process itself to post-process the raw simula- 
tion output to create more commensurate diagnostics. 

4.11.2 Climate System Complexity 

A further issue arises in creating statistical models of the climate system 
because both the real world and dynamical models have a large number of 
different physical variables. 

Even simplified models can have hundreds of variables, and while not 
all of them are essential to determining the state of the system, one var- 
iable is frequently not sufficient. Land, atmosphere, and ocean processes 
all have different dominant timescales, and thus different components are 
essential at different scales. Some physical understanding is thus neces- 
sary to make the proper variable/data choices, even with analysis schemes 
that extract structure from large datasets. Furthermore, these systems are 
chaotic, that is, initial conditions that are practically indistinguishable 
from each other in any given observing system will diverge greatly from 
each other on some short timescale. Thus, extracting useful predictions 
requires more than creating more accurate models — one needs to deter- 
mine which aspects are predictable and which are not. 

4.11.3 Challenge: Cloud-Computing-Based Reproducible 
Climate Data Analysis 

The study of science requires reproducible results: science is a body of 
work where the community strives to ensure that results are not from 
the unique abilities and circumstances of one particular person or group. 
Traditionally, this has been done in large part by publishing papers, 
but the scale of modern climate modeling and data analysis efforts has far 
outstripped the ability of a journal article to convey enough information 
to allow reproducibility. This is an issue both of size and of complexity: 
model results are much larger than can be conveyed in a few pages, and 
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both models and analysis procedures are too complex to be adequately 
described in a few pages. 

The sheer size of GCMs and satellite datasets is also outstripping our 
traditional data storage and distribution methods: frequently, only a few 
variables from a model’s output are saved and distributed at high reso- 
lution, and the remaining model output is heavily averaged to generate 
datasets that are sufficiently small. 

One promising approach to addressing these problems is cloud- 
computing-based reproducible climate data analysis. Having both the data 
and the analyses resident in the computational cloud allows the details of 
the computation to be hidden from the user; so, for example, data-intensive 
portions of the computation could be executed close to where the data 
resides. But these analyses must be reproducible, which brings not only 
technical challenges of archiving and finding, describing, and publishing 
analysis procedures, but also institutional challenges of ensuring that the 
large datasets that form the basis of these analyses remain accessible. 

4.11.3.1 Data Scale 

The size of datasets is rapidly outstripping the ability to store and serve the 
data. We have difficulty storing even a single copy of the complete archive 
of the CMIP3 model results, and making complete copies of those results 
and distributing them for analysis becomes both a large undertaking and 
limits the analysis to the few places that have data storage facilities of 
that scale. Analysis done by the host prior to distribution, such as averag- 
ing, reduces the size to something more manageable, but currently those 
reductions are chosen far in advance, and there are many other useful 
analyses that are not currently being done. 

A cloud-based analysis framework would allow such reductions to be 
chosen and still executed on machines with fast access to the data. 

4.11.3.2 Reproducibility and Provenance Graphs 

A cloud-based analysis framework would have to generate reproducible 
documented results; that is, we would not only need the ability to rerun a 
calculation and know that it would generate the same results, but also know 
precisely what analysis had been done. This could be achieved, in part, by 
having standardized analysis schemes, so that one could be sure precisely 
what was calculated in a given data filter, and also important is system- 
atically tracking the full provenance of the calculation. This provenance 
graph, showing the full network of data filters and initial, intermediate, 
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and final results, would provide the basis of both reproducibility and com- 
munication of results. Provenance graphs provide the information neces- 
sary to rerun a calculation and get the same results; they also provide the 
basis of the full documentation of the results. This full network would 
need to have layers of abstraction so that the user could start with an over- 
all picture and then proceed to more detailed versions as needed. 

4.12 CONCLUSION 

The goal of this chapter is to inspire future work in the nascent field of 
climate informatics. We hope to encourage work not only on some of the 
challenge problems proposed here, but also on new problems. A profuse 
amount of climate data of various types is available, providing a rich and 
fertile playground for future machine learning and data mining research. 
Even exploratory data analysis could prove useful for accelerating discov- 
ery. To that end, we have prepared a climate informatics wiki as a result of 
the First International Workshop on Climate Informatics, which includes 
climate data links with descriptions, challenge problems, and tutorials on 
machine learning techniques [14]. We are confident that there are myriad 
collaborations possible at the intersection of climate science and machine 
learning, data mining, and statistics. We hope our work will encourage 
progress on a range of emerging problems in climate informatics. 
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